Advanced models

1 Getting started

1.1 Required packages

library(highcharter) # For interactive plots
library(WDI)         # For World Bank data
library(RcppRoll)    # Rolling function fast with C++
library(openxlsx)    # A cool package to deal with Excel files/formats
library(quantmod)    # Package for financial data extraction
library(tsibble)     # TS with dataframes framework 
library(fable)       # Package for time-series models & predictions
library(feasts)      # Package for time-series analysis
library(forecast)    # Another package for TS (forecasting)
library(tseries)     # Simple package for time-series tests
library(plotly)      # For 3D graphs
library(mvtnorm)     # For multivariate Gaussian distributions
library(crypto2)     # For cryto data
library(ggsci)       # For color palettes
library(rugarch)     # For GARCH models
library(fGarch)      # For GARCH models
library(keras)       # For neural networks
library(vars)        # For VAR (Vector AutoRegressive) models
library(tidyverse)   # THE library for data science; at the end to override other functions
set.seed(42)         # Random seed

impute <- function(v, n = 6){     # Imputation function
  for(j in 1:n){
    ind <- which(is.na(v))
    if(length(ind)>0){
      if(ind[1]==1){ind <- ind[-1]}
      v[ind] <- v[ind-1]
    }
  }
  return(v)
}

1.2 Fetching some (crypto) data

coins <- crypto_list() 
c_info <- crypto_info(coin_list = coins, limit = 30)
❯ Scraping crypto info
❯ Processing crypto info
coin_names <- c("Bitcoin", "Ethereum", "Tether USDt", "BNB", "USDC", "Solana")
coin_hist <- crypto_history(coins |> dplyr::filter(name %in% coin_names),
                            start_date = "20170101",
                            end_date = "20251206")
❯ Scraping historical crypto data
❯ Processing historical crypto data

Coin USDC does not have data available! Cont to next coin.

Coin Solana does not have data available! Cont to next coin.

Coin Solana does not have data available! Cont to next coin.
coin_hist <- coin_hist |>  # Timestamps are at midnight, hence we add a day.
  dplyr::mutate(date = 1 + as.Date(as.POSIXct(timestamp, origin="1970-01-01")))
coin_hist <- coin_hist |> 
  group_by(name) |>
  mutate(return = close / lag(close) - 1 ) |>
  ungroup()

2 ARCH-type models

2.1 Realized volatility

Then, let’s have a look at volatility. It’s measured as

\[v_t=\sqrt{\frac{1}{T-1}\sum_{s=1}^T(r_{t-s}-\bar{r})^2},\] where \(\bar{r}\) is the mean return in the sample of \(T\) observations. Below, we use an optimized (C++-based function) to compute it via rolling standard deviation over 20 days.

coin_hist <- coin_hist |>
  group_by(name) |>
  na.omit() |>
  mutate(real_vol_20 = roll_sd(return, n = 20, fill = NA, na.rm = T))
coin_hist |>
  filter(name == "Bitcoin") |>
  pivot_longer(all_of(c("close", "real_vol_20")), names_to = "series", values_to = "value") |>
  ggplot(aes(x = date, y = value, color = series)) + geom_line() + theme_bw() +
  facet_wrap(vars(series), nrow = 2, scales = "free")# + scale_y_log10()
Warning: Removed 19 rows containing missing values or values outside the scale range
(`geom_line()`).

Now let’s have a look a the autocorrelation of the volatility: it’s highly persistent.

crypto_acf <- coin_hist |> 
  na.omit() %>%
  split(.$name) %>% 
  map(~acf(.$real_vol_20, plot = F)) %>% 
  map_dfr(
    ~data.frame(lag = .$lag, 
                acf = .$acf,
                ci = qnorm(0.975)/sqrt(.$n.used)
    ), .id = "name") 
crypto_acf |> 
  filter(lag > 0, lag < 15) |>
  ggplot(aes(x = lag, y = acf, fill = name)) + 
  geom_col(position = "dodge", alpha = 0.8) + theme_bw() +
  scale_fill_viridis_d(option = "magma", end = 0.9, begin = 0.1)

Hence, in practice, simple autoregressive models can be too limited because some variables are not exactly stationary.

The realized volatility is based on past returns. But markets (equity mostly) have another very cool and forward-looking indicator called the VIX. The VIX is slightly different, because it is based on forward-looking option prices (taking their implied volatilities over specific horizons). Still, it is linked (empirically) to the traditional volatility.

Let’s have a look below.

min_date <- "1998-01-01"                 # Starting date
max_date <- "2025-11-07"                 # Ending date
getSymbols("^VIX", src = 'yahoo',        # The data comes from Yahoo Finance
           from = min_date,              # Start date
           to = max_date,                # End date
           auto.assign = TRUE, 
           warnings = FALSE)
[1] "VIX"
VIX <- VIX |> as.data.frame() |>
  rownames_to_column("date") |>
  dplyr::mutate(date = as.Date(date))
head(VIX,7)                                   # Have a look at the result!
        date VIX.Open VIX.High VIX.Low VIX.Close VIX.Volume VIX.Adjusted
1 1998-01-02    24.34    24.93   23.42     23.42          0        23.42
2 1998-01-05    24.11    25.02   23.02     24.36          0        24.36
3 1998-01-06    25.20    25.97   24.87     25.66          0        25.66
4 1998-01-07    26.15    27.43   25.07     25.07          0        25.07
5 1998-01-08    26.24    26.70   25.62     26.01          0        26.01
6 1998-01-09    25.79    29.35   25.69     28.69          0        28.69
7 1998-01-12    28.69    31.08   28.02     28.02          0        28.02
vix_chart <- VIX |> dplyr::select(VIX.Close)
rownames(vix_chart) <- VIX |> dplyr::pull(date)
highchart(type = "stock") %>%
    hc_title(text = "Evolution of the VIX") %>%
    hc_add_series(as.xts(vix_chart)) %>%
    hc_tooltip(pointFormat = '{point.y:.3f}') 

We clearly see that the series exhibits some moments of extreme activity and its distribution is not the same through time. This is referred to as “clusters of volatility”.

2.2 Some theory

Therefore, we also need to introduce models that allow for this type of feature. In 1982, Robert Engle proposed such a model, called ARCH for “AutoRegressive Conditional Heteroskedasticity”, which was generalized in 1986 to GARCH models. Engle obtained the Nobel Prize in economics in part for this contribution.

As we show below, GARCH models have a direct link with auto-regressive processes!

The idea is to work with the innovations, \(e_t\) and to specify them a bit differently, i.e., like this: \(e_t=\sigma_t z_t\), where \(\sigma_t\) is going to be a changing standard deviation and \(z_t\) is a simple white noise. What matters is that the vol term is modelled as \[\sigma^2_t = a+\sum_{j=1}^pd_j\sigma^2_{t-j} + \sum_{k=1}^qb_ke^2_{t-k},\] hence, the value of \(\sigma_t^2\) depends both on its past and on the past of squared innovations.

There have been MANY extension to these models: T-GARCH, E-GARCH, etc. See for instance GARCH models: structure, statistical inference and financial applications.

2.3 Examples: estimation

There are many packages for GARCH estimation. We propose two: ruGARCH and fGARCH.

In fGARCH, the model is:

\[\sigma_t^2 = \omega + \alpha e_{t-1} + \beta \sigma_{t-1}^2.\] In the output, there are some additional parameters that allow to propose a more general model:

  • shape is the shape parameter of the conditional distribution.
  • skew is the skewness parameter of the conditional distribution.
  • delta a numeric value, the exponent delta of the variance recursion, fixed at 2 usually.

\(\delta\) values other than 2 come from P-GARCH models:

\[\begin{align} \varepsilon_t & = \sigma_t z_t, \qquad z_t \overset{\text{i.i.d.}}{\sim} \text{Skew}(\text{shape},\text{skew}), \\ \sigma_t^{\delta} &= \omega + \sum_{i=1}^{q} \alpha_i\bigl(|\varepsilon_{t-i}| - \gamma_i \varepsilon_{t-i}\bigr)^{\delta} + \sum_{j=1}^{p} \beta_j\,\sigma_{t-j}^{\delta}, \end{align}\]

The distribution is the Skewed Generalized Error Distribution that allows for more flexibility than the simple Gaussian law. It nests many simpler distribution, like the Cauchy, Laplace, GED and Student laws for instance.

fit_f <- garchFit(formula = ~garch(1,1), 
         data = coin_hist |> 
           filter(name == "Bitcoin") |> 
           pull(return), 
         trace = F) 
fit_f

Title:
 GARCH Modelling 

Call:
 garchFit(formula = ~garch(1, 1), data = pull(filter(coin_hist, 
    name == "Bitcoin"), return), trace = F) 

Mean and Variance Equation:
 data ~ garch(1, 1)
<environment: 0x152dd6ac0>
 [data = pull(filter(coin_hist, name == "Bitcoin"), return)]

Conditional Distribution:
 norm 

Coefficient(s):
        mu       omega      alpha1       beta1  
1.2790e-03  8.8107e-05  1.0231e-01  8.3473e-01  

Std. Errors:
 based on Hessian 

Error Analysis:
        Estimate  Std. Error  t value Pr(>|t|)    
mu     1.279e-03   5.246e-04    2.438   0.0148 *  
omega  8.811e-05   1.087e-05    8.108 4.44e-16 ***
alpha1 1.023e-01   1.184e-02    8.643  < 2e-16 ***
beta1  8.347e-01   1.577e-02   52.934  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log Likelihood:
 7086.565    normalized:  1.972874 

Description:
 Thu Dec  4 19:32:26 2025 by user:  

Now, with rugarch

spec <- ugarchspec()
fit_r <- ugarchfit(data = coin_hist |> 
                     filter(name == "Bitcoin") |> 
                     na.omit() |>
                     pull(return), 
                   spec = spec)
fit_r@fit$coef
           mu           ar1           ma1         omega        alpha1 
 1.271558e-03 -4.037899e-01  3.681241e-01  8.677615e-05  1.002666e-01 
        beta1 
 8.370735e-01 

The results are not exactly the same! mu and omega are close…
We recall the model below \[s_t = \omega + \alpha e_{t-1} + \beta s_{t-1}.\] ar1 is the autocorrelation coefficient, hence \(\beta\) and ma1 is \(\alpha\).

With both libraries, it is possible to make predictions.

predict(fit_f, n.ahead = 5)
  meanForecast  meanError standardDeviation
1  0.001279035 0.02777110        0.02777110
2  0.001279035 0.02847428        0.02847428
3  0.001279035 0.02911777        0.02911777
4  0.001279035 0.02970811        0.02970811
5  0.001279035 0.03025082        0.03025082
ugarchforecast(fit_r, n.ahead = 3)

*------------------------------------*
*       GARCH Model Forecast         *
*------------------------------------*
Model: sGARCH
Horizon: 3
Roll Steps: 0
Out of Sample: 0

0-roll forecast [T0=1979-10-14]:
       Series   Sigma
T+1 0.0020578 0.02914
T+2 0.0009541 0.02971
T+3 0.0013998 0.03024

The values are quite close!

A cool feature of {rugarch} is that you can use forward rolling, as in actual backtests (more on that below).

3 Vector Auto-Regression

3.1 The VAR(1) model

Now we are switching to full multivariate mode!
With the Vector Auto-Regression (VAR). It is written as:

\[X_t = \mu + A X_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d. } (0,\Sigma_\varepsilon),\] with \(X_t \in \mathbb{R}^k, \mu \in \mathbb{R}^k, A \in \mathbb{R}^{k \times k}\).

Interpretation

In two dimensions: \(X_t=[X_{t,1} \ X_{t,2} ]'\).

\[X_t = \mu + \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} X_{t-1} + \text{error}\]

So \(A_{11}\) measures the impact from \(X_{t-1, 1}\) onto $X_{t,1} $ (itself) and same for \(A_{22}\) for the second component. The non-diagonal terms drive the cross-effects: \(A_{12}\) is the impact of \(X_{t-1,2}\) on \(X_{t,1}\) and vice-versa.

A few probabilistic properties.

Mean

If the VAR(1) is stable (all eigenvalues of \(A\) satisfy \(|\lambda_i(A)|<1\)), the unconditional mean exists and is given by \[\mathbb{E}[X_t] = (I - A)^{-1}\mu.\]

Next, let \(\Gamma_0 = \mathbb{V}(X_t)\), it must satisfy \[ \Gamma_0 = A \Gamma_0 A' + \Sigma_\varepsilon.\]

Next, define the lag-\(h\) autocovariance matrix: \[\Gamma_h = \text{Cov}(X_t, X_{t-h}) = \mathbb{E}[X_t X_{t-h}'].\]

Using for simplicity \(X_t = A X_{t-1} + \varepsilon_t\) (the constant terms do not affect the covariances): \[ \Gamma_h = \text{Cov}(A X_{t-1} + \varepsilon_t, X_{t-h}) = A \text{Cov}(X_{t-1}, X_{t-h}) + \underbrace{\text{Cov}(\varepsilon_t, X_{t-h})}_{=0} = A \Gamma_{h-1} \]

Autocovariance

By recursion, we obtain the multivariate version of the AR(1): \[\Gamma_h = A^h \Gamma_0, \quad h \ge 0.\] Scaling by the variance, we get the autocorrelation: \(A^h\)!

3.2 The VAR(p) Model

A Vector Autoregression of order \(p\), denoted VAR(\(p\)), models a \(k\)-dimensional time series vector \(\mathbf{X}_t = (X_{1,t}, \dots, X_{k,t})^\top\) as:

\[ \mathbf{X}_t = \mathbf{c} + A_1 \mathbf{X}_{t-1} + A_2 \mathbf{X}_{t-2} + \cdots + A_p \mathbf{X}_{t-p} + \mathbf{e}_t, \]

where:

  • \(\mathbf{X}_t\) is a \(k \times 1\) vector of endogenous variables,
  • \(\mathbf{c}\) is a \(k \times 1\) vector of intercept terms,
  • \(A_i\) are \(k \times k\) coefficient matrices,
  • \(\mathbf{e}_t\) is a \(k \times 1\) vector of innovations (reduced-form errors).

Error term assumptions: the innovations satisfy

\[\mathbf{e}_t \sim (0, \Sigma_e),\]

where \(\Sigma_e\) is a symmetric positive definite \(k \times k\) covariance matrix, allowing contemporaneous correlation across equations.

3.3 Empirics

VARs are often used on macroeconomic data. Below, we fetch a sample from the World Bank API.

wb_raw <- WDI(                              # World Bank data
  indicator = c(
    "labor" = "SL.TLF.TOTL.IN",             # Labor force (# individuals)
    "savings_rate" = "NY.GDS.TOTL.ZS",      # Savings rate (% GDP)
    "inflation" = "FP.CPI.TOTL.ZG",         # Inflation rate
    "trade" = "NE.TRD.GNFS.ZS",             # Trade as % of GDP 
    "pop" = "SP.POP.TOTL",                  # Population
    "pop_growth" = "SP.POP.GROW",           # Population growth
    "capital_formation" = "NE.GDI.TOTL.ZS", # Gross capital formation (% GDP)
    "gdp_percap" = "NY.GDP.PCAP.CD",        # GDP per capita
    "RD_percap" = "GB.XPD.RSDV.GD.ZS",      # R&D per capita
    "educ_level" = "SE.SEC.CUAT.LO.ZS",     # % pop reachiing second. educ. level
    "educ_spending" = "SE.XPD.TOTL.GD.ZS",  # Education spending (%GDP)
    "nb_researchers" = "SP.POP.SCIE.RD.P6", # Nb researchers per million inhab.
    "debt" = "GC.DOD.TOTL.GD.ZS",           # Central gov. debt (% of GDP)
    "gdp" = "NY.GDP.MKTP.CD"                # Gross Domestic Product (GDP)
  ), 
  extra = TRUE,
  start = 1960,
  end = 2024) |>
  mutate(across(everything(), as.vector)) |>
  select(-status, -lending, -iso2c, -iso3c) |>  
  filter(region != "Aggregates", income != "Aggregates") |>
  arrange(country, year) |>
  group_by(country) |>
  mutate(across(everything(), impute)) |>
  mutate(gdp_growth = gdp_percap/dplyr::lag(gdp_percap) - 1, .before = "region") |>   
  ungroup() |>
  filter(lastupdated == max(lastupdated)) |>
  arrange(country, year) 

Ok, let’s focus on the US for simplicity. Taking into account other countries would mean/imply a Panel VAR.
To lighten the output (which can be heavy), we only focus on 3 variables.

wb_us <- wb_raw |> 
  filter(country == "United States") |>
  select(gdp_growth, inflation, pop_growth) |>
  na.omit()
fit_VAR <- VAR(wb_us, p = 1) 
fit_VAR |> summary()

VAR Estimation Results:
========================= 
Endogenous variables: gdp_growth, inflation, pop_growth 
Deterministic variables: const 
Sample size: 63 
Log Likelihood: 97.776 
Roots of the characteristic polynomial:
0.845  0.75 0.2559
Call:
VAR(y = wb_us, p = 1)


Estimation results for equation gdp_growth: 
=========================================== 
gdp_growth = gdp_growth.l1 + inflation.l1 + pop_growth.l1 + const 

              Estimate Std. Error t value Pr(>|t|)   
gdp_growth.l1 0.422576   0.141282   2.991  0.00405 **
inflation.l1  0.001643   0.001513   1.086  0.28192   
pop_growth.l1 0.004779   0.012570   0.380  0.70517   
const         0.020769   0.014248   1.458  0.15025   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.0253 on 59 degrees of freedom
Multiple R-Squared: 0.2914, Adjusted R-squared: 0.2554 
F-statistic: 8.087 on 3 and 59 DF,  p-value: 0.0001344 


Estimation results for equation inflation: 
========================================== 
inflation = gdp_growth.l1 + inflation.l1 + pop_growth.l1 + const 

              Estimate Std. Error t value Pr(>|t|)    
gdp_growth.l1 30.60299    8.81182   3.473  0.00097 ***
inflation.l1   0.56629    0.09439   6.000 1.29e-07 ***
pop_growth.l1 -0.83768    0.78398  -1.068  0.28965    
const          0.84518    0.88868   0.951  0.34546    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 1.578 on 59 degrees of freedom
Multiple R-Squared: 0.6864, Adjusted R-squared: 0.6705 
F-statistic: 43.05 on 3 and 59 DF,  p-value: 7.201e-15 


Estimation results for equation pop_growth: 
=========================================== 
pop_growth = gdp_growth.l1 + inflation.l1 + pop_growth.l1 + const 

              Estimate Std. Error t value Pr(>|t|)    
gdp_growth.l1 0.838545   0.535011   1.567    0.122    
inflation.l1  0.005373   0.005731   0.938    0.352    
pop_growth.l1 0.862075   0.047600  18.111   <2e-16 ***
const         0.059493   0.053956   1.103    0.275    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.09582 on 59 degrees of freedom
Multiple R-Squared: 0.8503, Adjusted R-squared: 0.8427 
F-statistic: 111.7 on 3 and 59 DF,  p-value: < 2.2e-16 



Covariance matrix of residuals:
           gdp_growth inflation pop_growth
gdp_growth  0.0006403   0.01862 -0.0004703
inflation   0.0186233   2.49093 -0.0238060
pop_growth -0.0004703  -0.02381  0.0091824

Correlation matrix of residuals:
           gdp_growth inflation pop_growth
gdp_growth     1.0000    0.4663    -0.1939
inflation      0.4663    1.0000    -0.1574
pop_growth    -0.1939   -0.1574     1.0000

Then, onto forecasting!

predict(fit_VAR, n.ahead = 3)
$gdp_growth
           fcst        lower      upper         CI
[1,] 0.04827954 -0.001316983 0.09787606 0.04959652
[2,] 0.05065569 -0.004265562 0.10557695 0.05492126
[3,] 0.05195311 -0.004846277 0.10875250 0.05679939

$inflation
         fcst       lower    upper       CI
[1,] 3.000879 -0.09247298 6.094230 3.093352
[2,] 3.223890 -0.97430754 7.422087 4.198197
[3,] 3.435725 -1.34243374 8.213883 4.778159

$pop_growth
          fcst     lower    upper        CI
[1,] 0.9528018 0.7649883 1.140615 0.1878135
[2,] 0.9374872 0.6911543 1.183820 0.2463330
[3,] 0.9274756 0.6415400 1.213411 0.2859356

Another potential package/function is {dynlm}, with documentation and tutorial for VAR.

4 A primer on prediction

4.1 Training/testing & co

The key idea is backtest: you have to put yourself in the shoes of someone living in the past. By this we mean, without knowledge of the future. To do this, you need to consider a loop that spans different dates in the past. At each point in time \(t\) you:

  1. extract the data available at time \(t\);
  2. build a predictive model;
  3. make some prediction based on the currently available data;
  4. evaluate the accuracy of the prediction thanks to the (future) realized values.

A simplified diagram below.

4.2 Some code

Even though it’s ugly, we will resort to loops.
Below, we focus on the simple case of a unique series (bitcoin) and a simple model (ARMA), but this can be applied to much more complex situations and models.
In the code below, we

  1. build predictions;
  2. derive the corresponding errors;
  3. implement a trading strategy (we buy if returns are predicted to be positive and short sell otheriwse).
btc <- coin_hist |> filter(name == "Bitcoin", wday(date) == 2) |> 
  ungroup() |> dplyr::select(date, close) |>
  mutate(fwd_return = lead(close) / close - 1) |> na.omit()
pred <- 0       # To store the predictions
err <- 0        # To store the error
strat_ret <- 0  # Return of the strategy
buffer <- 100   # Time required to start learning
tc <- 0.001     # Transaction costs
trading_dates <- btc$date[buffer:nrow(btc)]
for(j in 1:(nrow(btc)-buffer+1)){
  if(j%%100 == 0){print(trading_dates[j])}
  train <- btc |> filter(date < trading_dates[j]) # expanding windows
  fit <- arima(train$fwd_return, order = c(1,0,1), include.mean = T)
  pred[j] <- predict(fit, n.ahead = 1)$pred[1] |> as.numeric()
  err[j] <- btc |> filter(date == trading_dates[j]) |> pull(fwd_return) - pred[j]
  strat_ret[j] <- (btc |> filter(date == trading_dates[j]) |> pull(fwd_return)) * sign(pred[j])
  if(j>1){strat_ret[j] <- strat_ret[j] - tc * abs(sign(pred[j])-sign(pred[j-1])) } # Costs
}
[1] "2020-10-26"
[1] "2022-09-26"
[1] "2024-08-26"
[1] "2025-08-18"

Let’s analyze the outcome. First, the errors…

sd(btc$fwd_return)
[1] 0.09755115
sd(err)
[1] 0.08771966

At least errors have lower variance than the returns themselves.
Next, onto the strategy.

mean(strat_ret > 0) # This is the hit ratio = accuracy
[1] 0.5157385
mean(strat_ret)     # The average weekly return
[1] 0.007221179
sd(strat_ret)       # The volatility
[1] 0.08760782
(mean(strat_ret)*52 - 0.03)/sqrt(52) / sd(strat_ret) # The Sharpe ratio
[1] 0.5468966

Let’s visualize this…

data.frame(date = trading_dates, return = strat_ret) |>
  mutate(portfolio_value = cumprod(1+return)) |>
  ggplot(aes(x = date, y = portfolio_value)) + geom_line() +
  theme_classic() + theme(axis.title = element_blank())

NOTE: in practice trading costs are super important.
We have assumed that it’s possible to short the BTC: not so obvious, but short ETFs do exist.

4.3 What really matters (in finance)

Models (and people) often focus on loss (objective) functions.
In finance, we usually try to forecast returns, hence the loss function is the (root-) Mean Squared Error:

\[\text{MSE} = \frac{1}{T} \sum_{t=1}^T (r_t-\tilde{r}_t)^2, \]

where \(r_t\) is the realized return and \(\tilde{r}_t\) is the model return. Assuming predictions are unbiased \((\mathbb{E}[r_t]=\mathbb{E}[\tilde{r}_t])\), then (after some math…)

\[\text{MSE} = \mathbb{E}[(r_t - \tilde{r}_t)^2]=\mathbb{V}[r_t]+\mathbb{V}[\tilde{r}_t]-2\text{Cov}(r_t, \tilde{r}_t)\]

Question: which of these terms matter the most?

  • the variance of realized returns: nothing we can do about that.
  • the variance of predicted returns: this enters the variace-bias tradeoff; but it’s not super clear what the impact it can have.
  • the correlation between predicted and realized returns: this is the CRUX!!! This measures the accuracy (very similar to the hit ratio!)

5 Advanced temporal models

Before we can move towards chronology-based neural networks, we must explain how simple, feed-forward, NNs work.

5.1 (Feedforward) neural networks

A few images help explain how simple NNs work.

First, the structure. The network takes an input which is then processed via many multiplications, additions and activations.
At the very end, we get the output. It can be a simple number, but could also be more complex (vector, matrix, etc.).

The intermediate activation functions are there to break the linearity. A few examples below:

Next, the training. This phase is intended to determine “good” parameter values. Obviously, the “best” parameters would be ideal, but they are out of reach (the problem is too complex).

The core idea is gradient descent: we seek to minimize a loss function (e.g., the MSE) and this depends on the derivative:

Now, there is a very interesting process called backpropagation that involves lots of computations of derivatives (skipped here):

In the end, the parameters (weights) of the model are locally optimal.
But they depend on the training sample: whether they work well out-of-sample (on testing sets) is another question.
If yes, the model is said to generalize well. That’s the great quest of machine learning.

Note

An important property of NNs is that they are universal approximators: Any continuous function \(f\) can be approximated up to arbitrary precision (on a finite interval) by a single layer perceptron with smooth (bounded continuous) activation function.
How come? You just need to add units to help the perceptron overfit!

5.2 Recurrent networks

Feed-forward have zero memory. How can we bring some in the model?

Problem: vanishing gradients… Hence: longer term memory! LSTM or GRU

Gated recurrent units

The system of equations is the following \[\begin{align*} \tilde{y}_i&=z_i\tilde{y}_{i-1}+ (1-z_i)\tanh \left(\textbf{w}_y'\textbf{x}_i+ b_y+ u_yr_i\tilde{y}_{i-1}\right) \quad \textcolor{red}{\text{output}} \\ z_i &= \text{sig}(\textbf{w}_z'\textbf{x}_i+b_z+u_z\tilde{y}_{i-1}) \quad \textcolor{red}{\text{`update gate'}} \ \in (0,1)\\ r_i &= \text{sig}(\textbf{w}_r'\textbf{x}_i+b_r+u_r\tilde{y}_{i-1}) \quad \textcolor{red}{\text{`reset gate'}} \ \in (0,1) \end{align*}\]

Preparing the data should follow this principle/structure

Sequence 1: [price_day1, price_day2, …, price_day10] → target: price_day11
Sequence 2: [price_day2, price_day3, …, price_day11] → target: price_day12
Sequence 3: [price_day3, price_day4, …, price_day12] → target: price_day13

5.3 Example

First, let’s fetch some new data! And compute returns.

getSymbols("NVDA", from = "2010-01-01")
[1] "NVDA"
prices <- Cl(NVDA)                  # Closing price
nvda <- prices / lag(prices) - 1    # Compute return
nvda <- nvda[-1]                    # Removing missing first element

Next, let’s prepare and format the inputs/outputs.

n <- length(nvda)
train_size <- floor(0.8 * n)
look_back <- 20

# Create sequences
X <- matrix(NA, nrow = n - look_back, ncol = look_back)
y <- numeric(n - look_back)
for (i in 1:(n - look_back)) {
  X[i, ] <- nvda[i:(i + look_back - 1)]
  y[i] <- nvda[i + look_back]
}
head(X[,1:5])
             [,1]         [,2]         [,3]         [,4]        [,5]
[1,]  0.014602526  0.006396570 -0.019597489  0.002161031 -0.01401618
[2,]  0.006396570 -0.019597489  0.002161031 -0.014016185 -0.03389832
[3,] -0.019597489  0.002161031 -0.014016185 -0.033898325  0.01358237
[4,]  0.002161031 -0.014016185 -0.033898325  0.013582371 -0.01563372
[5,] -0.014016185 -0.033898325  0.013582371 -0.015633723 -0.02949520
[6,] -0.033898325  0.013582371 -0.015633723 -0.029495204  0.01870255

Check the structure of features.
Final prep: splitting & shaping

X_train <- X[1:train_size, ]
y_train <- y[1:train_size]
X_test <- X[(train_size + 1):nrow(X), ]
y_test <- y[(train_size + 1):length(y)]

# Reshape for GRU [samples, timesteps, features]
X_train <- array(X_train, dim = c(nrow(X_train), ncol(X_train), 1))
X_test <- array(X_test, dim = c(nrow(X_test), ncol(X_test), 1))

Now: onto buidling the (simple) model.

model <- keras_model_sequential() %>%
  layer_gru(units = 16, input_shape = c(look_back, 1), return_sequences = FALSE) %>%
  layer_dense(units = 1, activation = "tanh")

model %>% compile(optimizer = optimizer_adam(learning_rate = 0.001), loss = "mse")

Training

fit_ <- model %>% fit(X_train, y_train, epochs = 60, batch_size = 32, verbose = 0)
plot(fit_)

We indeed see some improvement on the training set. It slows down after ~20 epochs.

Predict & evaluate.

y_pred <- model %>% predict(X_test, verbose = 0)

mean(abs(y_test - y_pred)) # MAE
[1] 0.02314833
mean(y_test * y_pred > 0)  # Hit ratio
[1] 0.5441741
cor(y_test, y_pred)
           [,1]
[1,] 0.07798228
data.frame(pred = y_pred, actual = y_test, t = 1:length(y_test)) |>
  pivot_longer(-t, values_to = "value", names_to = "type") |>
  ggplot(aes(x = t, y = value, color = type)) + geom_line() +
  theme_minimal()

Clearly the dispersion of prediction is flawed. But the correlation seems positive… which is what matters!

5.4 Code for universal approximation

A first model.

model_ua <- keras_model_sequential()
model_ua %>%   # This defines the structure of the network, i.e. how layers are organized
  layer_dense(units = 16, activation = 'sigmoid', input_shape = 1) %>%
  layer_dense(units = 1) # 
model_ua %>% keras::compile(                      # Model specification
  loss = 'mean_squared_error',                  # Loss function
  optimizer = optimizer_rmsprop(),              # Optimisation method (weight updating)
  metrics = c('mean_absolute_error')            # Output metric
)
summary(model_ua)   
Model: "sequential_1"
________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     
================================================================================
 dense_2 (Dense)                    (None, 16)                      32          
 dense_1 (Dense)                    (None, 1)                       17          
================================================================================
Total params: 49 (196.00 Byte)
Trainable params: 49 (196.00 Byte)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
fit_ua <- model_ua %>% 
  fit(seq(0, 6, by = 0.001) %>% matrix(ncol = 1),              # Training data = x
      sin(seq(0, 6, by = 0.001)) %>% matrix(ncol = 1),         # Training label = y
      verbose = 0,
      epochs = 30, batch_size = 64                             # Training parameters
  ) 

A second model.

model_ua2 <- keras_model_sequential()
model_ua2 %>%   # This defines the structure of the network, i.e. how layers are organized
  layer_dense(units = 128, activation = 'sigmoid', input_shape = 1) %>%
  layer_dense(units = 1) # 
model_ua2 %>% keras::compile(                     # Model specification
  loss = 'mean_squared_error',                  # Loss function
  optimizer = optimizer_rmsprop(),              # Optimisation method (weight updating)
  metrics = c('mean_absolute_error')            # Output metric
)
summary(model_ua2)  
Model: "sequential_2"
________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     
================================================================================
 dense_4 (Dense)                    (None, 128)                     256         
 dense_3 (Dense)                    (None, 1)                       129         
================================================================================
Total params: 385 (1.50 KB)
Trainable params: 385 (1.50 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
fit_ua2 <- model_ua2 %>% 
  fit(seq(0, 6, by = 0.0002) %>% matrix(ncol = 1),            # Training data = x
      sin(seq(0, 6, by = 0.0002)) %>% matrix(ncol = 1),       # Training label = y
      verbose = 0,
      epochs = 60, batch_size = 64                            # Training parameters
  ) 

The comparisons.

tibble(x = seq(0, 6, by = 0.001)) %>%
  ggplot() + 
  geom_line(aes(x = x, y = predict(model_ua, x), color = "Small model")) +
  geom_line(aes(x = x, y = predict(model_ua2, x), color = "Large model")) +
  stat_function(fun = sin, aes(color = "sin(x) function")) + 
  scale_color_manual(values = c("#EEAA33", "#3366CC", "#000000")) + theme_bw()
188/188 - 0s - 92ms/epoch - 491us/step
188/188 - 0s - 90ms/epoch - 476us/step